import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
import nltk
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from IPython.display import HTML, display
import pprint
import warnings
warnings.filterwarnings('ignore')
import plotly.io as pio
pio.renderers.default='notebook'
# Create toggle cell button
font = "Roboto-Regular.ttf"
pp = pprint.PrettyPrinter(indent=4, width=100)
HTML('''
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
.output {
display: flex;
align-items: left;
text-align: justify;
}
</style>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
value="Click here to toggle on/off the raw code."></form>
''')
Unbeknownst to many, Python is not a young programming language. Guido Van Rossum created Python in late 1980s, almost three decades ago, as a general purpose programming language. It can be attributed to this very reason as to why Python rose to fame in recent years. Programming is slowly becoming a democratized skill with more people learning how to code each day. Python, with its beginner friendly syntax has been useful for a wide array of applications and has continuously propered and evolved throughout the years with algorithmic breakthroughs and innovations through Python becoming public year by year.
With more people from diverse backgrounds entering into programming through Python, the vast number of questions and ideas have been well documented in the archives of StackOverflow as seasoned programmers and newly integrated learners crowd towards the platform in search of answers to their programming impediments. In this analysis, our objective is to identify the growth of Python within the last decade by looking at the diversity of Posts within StackOverflow. Through this study, we wanted to understand how exactly has Python evolved? And from our findings, we wish to see the direction in which Python programming is headed to.
StackExchange, or more specifically, StackOverflow has served as the go-to Q&A platform of a majority of programmers for their coding endeavours. Founded back in 2008, StackOverflow has built a community of developers, programmers, and learners alike and has documented their algorithmic failures, breakthroughs, and ideas throughout the years. As a scope of our study, we look into the varying themes of questions collated within the platform over the past decade (2010-2019). Specifically, we look into topics involving Python programming and performed dimensionality reduction techniques to identify themes among the thousad of tags related to Python. A total of 22007 unique tags were obtained out of a total of over 4 million python related posts.
Our results show that Python community has continously grown over the past few decade where from an initial point of reference of Python questions back in 2010, we can see an 800% increase in the total number of posts from 2010 to 2019. Along with the increase in the total number of questions regarding Python Programming, the diversity of topics discussed has seen a large increase as well from less than 1500 unique tags in 2010 to approximately 4500 unique tags in 2019.
Through our analysis, we have explored the frequency of appearance of certain tags throughout the last decade. And found that a majority of Python users in the early portion from 2010 to 2019 primarily discussed on the topic of Django web development before dropping in popularity in 2015 in favor of the data analysis tools such as pandas and tensorflow which rose in popularity along with machine-learning and deep-learning in the latter portions of the year. The records of StackOverflow had also captured an interesting trend with respect to the migration of Python users from Python 2.X to Python 3.X centering around the eventual death of Python 2.7 by 2020.
Python has evolved and changed over the years but through our analysis we have found that there have been a consisent list of topics that have been the general themes of discussions regarding Python. The team has found that the Python StackOverflow community have generally discussed these 12 general topics in high frequencies over the last decade : Generic Python, Django Framework, Web Apps, Python Data Types, Web scraping, GUI tools, Python-2.7, Computations Tools, Python-3.x, Data Transformation, Deep Learning, and Machine learning
StackOverflow, in recent years, has become the preferred site for veteran programmers and new coders for programming-related questions. Over the years, it has served as a site where programmers at various stages of their careers gather to ask, explore, and guide their fellow programmers in their algorithmic endeavours. With programming becoming more accessible and simpler to learn, more people are attracted towards StackOverflow either through sheer curiosity, to ask for help, or maybe even to lend a helping hand.
With languages such as Python paving the way for more readable code as well as its sheer ability to foster open-collaboration between its users, people have begun to diversify their questions surrounding the idea of Python programming along with its related topics. In this study, the group wants to identify how these topics have changed over the years and identify key topics at varying moments in time as to identify gaps in the supply and demand of information and to be able to recommend possible interventions that may be done to assist the platform.

Stack Overflow is a popular questions and answers website in the Stack Exchange network designed to cater specifically for programming related questions. Stack Overflow boasts of having over 100 million monthly users with over billions of posts since it began in 2008. In this analysis, we will be looking at the journey of Python from 2010 to 2019 to witness how it has changed through the years.
# Visualization of python counts in Stack Overflow over time
df_month = pd.read_csv("python_counts.csv").set_index("month")
fig, ax1 = plt.subplots(figsize=(12, 5), dpi=80)
ax2 = ax1.twinx()
ax1.plot(df_month.index, df_month['docs'], 'g-', marker='.', color='#FFD43B')
ax2.plot(df_month.index, df_month['tags'], 'b-', marker='.', color='#306998')
ax1.set_xlabel('Year')
ax1.set_ylabel('Nos. of Questions',
color='#FFD43B',
fontsize=14)
ax1.set_xticks(np.append(np.arange(0, 120, 12), 119), [])
ax2.set_ylabel('Nos. of Tags', color='#306998', fontsize=14)
plt.title("Python's Growth in the Last 10 Years")
plt.show()
From 2010 to 2019, Stack Overflow has accumulated roughly 4,117,394 questions tagged with the word Python. In Figure 1, we can observe that Python was averaging less than 10,000 questions in 2010. Yet, from this point onward, the number of questions continued to increase. It has been increasing at a constant rate since 2010 and has not lost momentum.
Not only has the number of questions been increasing but also the number of tags. A total of 22,007 unique tags were obtained for the period of 2010 to 2019. Similar to the total number of questions, the number of tags have also followed an increasing trend. This means that the user base of python is not only getting larger but that the uses for python may also be getting more diverse because of the constant increase in tags that people use.
# Visualization of python tags in Stack Overflow over time
df_tfidf = pd.read_csv('words_rank.csv')
highlight = alt.selection(type='single', on='mouseover', bind='legend',
fields=['Term'], nearest=True, empty="none")
chart = alt.Chart(df_tfidf,
title="Top Tags Across the Years").mark_line(
).encode(
x=alt.X('Year:O'),
y=alt.Y('Rank',
scale=alt.Scale(domain=[10.5, 0.9]),
axis=alt.Axis(tickCount=10,
labelColor=alt.condition('datum.value < 1 |'
'datum.value > 10' ,
alt.value('white'),
alt.value('black')))),
color=alt.Color("Term",
legend=alt.Legend(title="Term")),
# strokeDash=alt.Stroke('Term',
# legend=None),
size=alt.condition(~highlight, alt.value(1), alt.value(5))
).properties(
width=780,
height=400
)
points = alt.Chart(df_tfidf).mark_point(opacity=1
).encode(
x=alt.X('Year:O'),
y='Rank',
color=alt.Color("Term")
).add_selection(
highlight
)
layer = chart + points
layer = layer.configure_axis(
grid=False
).configure_point(
size=5
).configure_title(
fontSize=18,
offset=5,
orient='top',
anchor='middle')
layer
We then looked at the trend of the most popular tags over time. TFIDF score of each tag was calculated for each year and plotted in Figure 2. Django dominated the tags from 2010 to 2015 before dropping to third place in 2016. Another popular search in 2010 was Google-app-engine ranking 2nd in 2010. However, this was short-lived with Google-app-engine failing to make it to the top 10 most common tags from 2013 onwards. List, regex, string and windows also had a decreasing trend in TFIDF score starting from 2010.
Replacing the mentioned tags are new Python packages. Pandas first appeared in the top 10 in 2013 and increased until it peaked at rank 1 in 2016. From 2017 to 2019, Pandas remains to be 2nd most searched tag associated with Python. Dataframe is another tag that entered the top 10 in 2016 while Tensorflow joined the top ranks in 2017. Meanwhile, the likes of numpy and matplotlib are consistently in the top 10 in the past years.
We can also follow the trend of versions of Python. Python-2.7 first appeared in top 10 tags in 2012 and reached the 2nd most searched tag from 2013 to 2015 before it began to decline. Python-3.x was already one of the top 10 used tags in 2011 but peaked in 2018 when it became the most popular tag associated with Python. This actually makes sense given that Python-3.0 was first released in 2008 while Python-2.7 was released in 2010. As Python-3.x gained stability, so did its number of users as reflected by questions on Stack Overflow. The decline of Python-2.7 may be attributed to people shifting to Python-3.x.
# Visualization of features per
import matplotlib.colors
df_excel = pd.read_excel('svd_topics.xlsx',
sheet_name="topics").iloc[:, 1:].dropna(how='all')
df_excels = ['df_excel0', 'df_excel1', 'df_excel2', 'df_excel3',
'df_excel4', 'df_excel5', 'df_excel6', 'df_excel7',
'df_excel8', 'df_excel9', 'df_excel10', 'df_excel11']
#colorsList = [(tuple rgb color 1),(tuple rgb color 2),(tuple rgb color 3)]
CustomCmap = matplotlib.colors.ListedColormap(['#FFD43B', '#306998'])
dct = {}
fig, axs = plt.subplots(4, 3, figsize=(18, 14))
axs = axs.ravel()
for i, df in enumerate(df_excels):
df = (df_excel[str(df_excel.columns[i])].str.replace('(', ''))
df = df.str.replace(')', '')
df = df.str.replace("'", '')
df = df.str.split(',').dropna()
dct = {}
for a, x in df.values:
dct[a] = float(x)
wordcloud = (WordCloud(background_color="white",
min_font_size=8,
colormap=CustomCmap,
prefer_horizontal=1)
.generate_from_frequencies(frequencies=dct))
axs[i].imshow(wordcloud, interpolation='bilinear')
axs[i].set_title(str(df_excel.columns[i]), fontsize=15)
axs[i].axis("off")
#plt.tight_layout()
plt.suptitle("Top Features for Each Topic", fontsize=20, y=0.925)
plt.show()
For every year, the top 6 singular values (SV) were assigned a topic based on the words with the highest feature importance. Topics with similar features across the years were combined resulting to 12 derived topics formed.
It is interesting to note that while majority of the topics identified followed the top words in TFIDF, there are also words that did not follow this trend. We can see that arrays and tkinter are important features in their respective SV despite not being in the top words in TFIDF. Inversely, flask is a tag that is in top ten for each year but was not able to be strong feature for any of the SVs.
Figure 3 visualizes the results where the size of the word corresponds to the feature importance of that particular word. The 12 topics can be summarized as:
Generic Python - comprise mostly of the tag Python as well as Python-2.7 and Python3.x, Django Framework - comprise of different variation of the word django,Web - consist of mostly google-app-engine and google-cloud-datastore,Python Data Types - has list and dictionary as its most important features together with strings and loops, Web scraping - has regex as its most important feature and also contains the tags string, parsing, beautiful-soup,GUI tools - has windows, tkinter, linux, Python-2.7 - is dominated by the python-2.7 tag but also has python-3.x, Computations Tools- contains numpy, arrays, and scipy as well as matplotlib and pandas, Python-3.x - is similar to Python-2.7 but dominated by Python-3.x tags instead, Data Transformation - contains the tags pandas, numpy, dataframe, as well as group-by, excel, and csv, Deep Learning - highlights tensorflow and keras as its most important feature and, Machine learning - has similar words to Deep Learning but highlights numpy, arrays, and tensorflow as well. import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
df_topic_values = pd.read_csv('word_topic_svd_values.csv')
# Set up interactive highlighting
fig = go.FigureWidget()
fig.layout.hovermode = 'closest'
fig.layout.hoverdistance = -1 #ensures no "gaps" for selecting sparse data
# Plot Scatter Plots per Topic
for t in df_topic_values.topic.unique():
subdf_topic_values = df_topic_values[df_topic_values.topic == t]
fig.add_trace(
go.Scatter(x = subdf_topic_values.topic,
y = subdf_topic_values.words,
mode = 'markers',
name = t,
marker = dict(
size = subdf_topic_values.score,
sizemode = 'area',
sizeref = 2*max(subdf_topic_values.score)/(40**2),
sizemin = 1,
opacity = 0.6,
color='#FFD43B',
line=dict(width=1,
color='#FFC622')
),
hovertemplate = "Topic : %{x}<br>" +
"Tag : %{y}<br>" +
"Value : %{marker.size}"
)
)
# Stylize Figure
fig.update_layout(
xaxis={'side': 'top',
'tickangle' : -30},
width=1000,
height=1500,
template='none',
yaxis_title="<b>Tags</b>",
legend_title="<b>Derived Topics</b>",
title={
'text': "<b>Contribution of Derived SVD Topics<b>",
'y':1,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
yaxis_range=[-1,len(df_topic_values.words.unique())],
xaxis_range=[-1,len(df_topic_values.topic.unique())],
)
# Initially Highlight Data Transformation
fig.data[9].marker['color'] = '#306998'
fig.data[9].marker['line']['color'] = '#0A6F99'
fig.data[9].marker['line']['width'] = 3
def update_trace(trace, points, selector):
"""Change color of selected trace and reset to base color of previously
selected traces"""
# this list stores the points which were clicked on
# in all but one trace they are empty
if len(points.point_inds) == 0:
trace.marker['color'] = '#FFD43B'
trace.marker['line']['color'] = '#FFC622'
trace.marker['line']['width'] = 1
return
trace.marker['color'] = '#306998'
trace.marker['line']['color'] = '#0A6F99'
trace.marker['line']['width'] = 3
# Enable On Click Events for each trace
for i in range( len(fig.data) ):
fig.data[i].on_click(update_trace)
fig.show()
Figure 5 visualizes the saliency of the top tags of each of the derived SVD topics, whereas the size of the markers indicate the corresponding projections of each tag at each topic. From the visual above, we can see that most of the topics actually have at least one major component with a value greater ~80%. For example, we can look at the topic of Data Transformation where we can observe that the top component for this particular topic is the tag pandas with a value of ~0.79. Similarly, there are topics with multiple noticeable components such as the topic of Deep learning where the main tags related to this topic are tensorflow with a value of 0.80 and keras with a value of 0.46. By combining the visual information we can see from Figure 3 and Figure 4, we can see that there are distinct topics that have continuously been discussed in StackOverflow throughout the last decade.
df_topics = pd.read_csv('topics_ranks.csv')
highlight = alt.selection(type='single', on='mouseover', bind='legend',
fields=['Topic'], nearest=True, empty="none")
chart = alt.Chart(df_topics,
title="Top Topics Across the Years").mark_line().encode(
x=alt.X('Year:O'),
y=alt.Y('SV',
scale=alt.Scale(domain=(7.5, 0.5), clamp=True),
axis=alt.Axis(tickCount=6, titleAngle=0, titlePadding=15,
labelColor=alt.condition('datum.value > 6',
alt.value('white'),
alt.value('black')))),
color=alt.Color('Topic',
legend=alt.Legend(title="Topics")),
strokeDash=alt.Stroke('Topic',
legend=None),
#color=alt.condition(highlight, 'Topic', alt.value("lightgray")),
size=alt.condition(~highlight, alt.value(1), alt.value(5)),
tooltip=['SV','Topic'],
).properties(
width=780,
height=400
)
points = alt.Chart(df_topics).mark_point(opacity=1).encode(
x=alt.X('Year:O'),
y='SV',
color=alt.Color("Topic", legend=None)).add_selection(
highlight
)
layer = chart + points
layer.configure_axis(
grid=False
).configure_title(
fontSize=18,
offset=5,
orient='top',
anchor='middle')
As seen in Figure 5, the top topics related to Python programming has changed over the last decade. Note that as it was expected that Generic Python questions would top over other specialized topics, we excluded Generic Python for our analysis of this figure.
Noticeably, at the onset of the decade, Python developers in StackOverflow mainly explored Python as a means to web development as topics such as Django Framework and Web App were ranked within the top 4 topics from 2010 up till 2012. However by 2014, a noticeable jump in popularity was observed for topics involving Python 2.7 which might be due to adjacent events of the end-of-life of Python 2.6 in October of 2013 and the release of Python 3.3 in March of 2014. This might have led to the mass migration of Python programmers to a newer version of the language. This may also be correlated with the fact that there are significant differences in syntax between Python 2.X and Python 3.X which may have caused a multitude of errors that could have been experienced by programmers and python developers alike. A similar pattern can be observed within the latter portions of our visualization whereas Python 3.X had completely overtaken Python 2.7 as of 2016 where Python 2.7 had dropped from the top 5 topics as of 2019 as the announcement of its imminent death in 2020 had led to newer python enthusiast preferring Python 3.X.
Another observation that we can derive for Figure 5 is the rise of interest regarding Machine Learning, Deep Learning and Data Transformations in the latter part of the decade which we may attribute to the sudden rise in popularity of Data Science using the Python Programming language.
It is evident through the data that Python has not only grown more popular but also diversified over the years. The creator of Python, Guido Van Rossum, himself was surprised at the popularity that Python has gained stating that “I certainly didn’t set out to create a language that was intended for mass consumption”. But what began as a tool heavily used for web development framework in the early 2010s is now being used for multiple other purposes.
Through the use of term frequency-inverse document frequency and singular value decomposition, we were able to identify the top tags that are being searched in Stack Overflow, group them together, and provide them with topics according to their features. This allowed us to track the the major uses of Python in a span of 10 years.
Our analysis showed that Python is an evolving programming language. Although Django is still one of the most popular Python related tag in Stack Overflow today, it enjoys the company of many other Python packages. Computational tools such as Numpy and data transformation tools such as Pandas have been consistent top tags since 2015. Within the past two years, Python is further evolving with the likes of Tensorflow, Keras, machine-learning and deep-learning being popular tags. It is interesting to observe how the trend will continue in the next years but if the current trend remains, then Python will continue to grow and diversify further for the use of even more people. If this occurs, then the more we will need to ensure that Stack Overflow is able to provide the quality assistance it currently does to beginners and experienced programmers.